import pandas as pd
import numpy as np
The data is related with direct marketing campaigns of a Portuguese banking institution.
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
This is to analyse based on different data point, wheather a client shuold subscribe a Term Deposit Account.
df= pd.read_csv(r"C:\Users\$ubhajit\Documents\bank-additional\bank-additional/bank-additional-full.csv",sep = ';')
Bank data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
df.shape
Our dataset contains 41,188 records and 21 attributes.
Here's how our dataset looks like
df.head(10)
We look at the statistical summary of all the variables present in our dataset
df.describe(include = 'all').T
We observe the following:
Age has a range between 17 to 98. Average age is ~40.
Customers having 12 job categories in the dataset and 'admin' is the predominant category. This could be due to usually admin handles company bank account related matters.
Most of the customers having university degree.
Most of the customers do not have default.
MOst of the customers do not have loan.
Most of the customers are cellular phone number user.
~50% of the customers having housing loan.
~82% of the costomers do not have personal loan.
~86% are the fresh customers for the banking campaign.
import plotly.express as px
Here we are looking at age distribution of targeted users.
df[['age']].describe().T
px.histogram(df , x = 'age')
We can see the Age is Right Skewed and average age of the targeted users is 40.
We can also observe most of the targeted customers age is between 30 and 40.
We are now looking at job types distribution.
df[['job']].describe().T
We can see 12 different types of job. Admin job is repeated maximum time.
px.histogram(df , x = 'job')
Admin and blue-collar job holder are maximum among the targeted customers.
We also observe the data information is targeting unemployed and students.
We look at distribution of Marital status
df[['marital']].describe().T
px.histogram(df , x = 'marital')
We observe that most of the targeted customers are married (~60%) or single (~28%).
We will look at the distribution of the education among the targeted customers.
df[['education']].describe().T
px.histogram(df , x = 'education')
Most of our targeted customers have either have University degree (~30%) or High school passed (~23%). There are some illiterate users also (only 18) in less numbers.
Here we look whether our targeted users are defaulters , or have a personal/housing loan.
df[['default','housing','loan']].describe().T
px.histogram(df , x = 'default')
px.histogram(df , x='housing')
px.histogram(df , x='loan')
We can see that most of the targeted Users do not have housing loan, Personal loan. We have also seen that most of the usres are not defaulters.
We create dummies for the following variables :-
df1 = pd.get_dummies(df , columns=['marital','job','education','poutcome'])
df1.head()
For other variables such as :-
df1['default'] = np.where(df1['default'] == 'yes', 1, 0)
df1['housing'] = np.where(df1['housing'] == 'yes', 1, 0)
df1['loan'] = np.where(df1['loan'] == 'yes', 1, 0)
df1['contact'] = np.where(df1['contact'] == 'cellular', 1, 0)
df1.head()
df1['y'] = np.where(df1['y'] == 'yes',1 , 0)
import seaborn as sns
We look at the dataset from two variable perspective.
First , we look at correlations
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (25,15))
sns.heatmap(df1.corr(),cmap='RdBu')